m
LABORATORY FOR
COMPUTER SCIENCE
w
Lk
MASSACHUSETTS
INSTITUTE OF
TECHNOLOGY
MIT/LCS/TR-495
THE SPECTRAL NORM OF
FINITE FUNCTIONS
Mihir Bellare
February 1991
545 TECHNOLOGY SQUARE, CAMBRIDGE, MASSACHUSETTS 02139
This document for written for my area exam. The assigned papers were [FJS],[KM] and [SB].
The Spectral Norm of Finite Functions
Mihir Bellare
Laboratory for Computer Science
Massachusetts Institute of Technology
February 1, 1991
Abstract
In many recent results in learning and computational complexity theory which rely on Fourier
analysis, the spectral norm plays a key role. An understanding of this quantity would appear to be
useful in both gauging and exploiting these results, and in understanding the underlying techniques.
This paper surveys various aspects of the spectral norm of finite functions. We consider some
of the motivating results as well as both upper and lower bounds on the spectral norm and their
relationships to the computational complexity of the function.
Some of the results included here are new. In particular, we introduce a general technique for
upper bounding the spectral norm of a decision tree over an arbitrary basis. We also extend the
learning algorithm of Kushilevitz and Mansour [KM] to mutually independent distributions.
Keywords: Fourier series, spectral norm, learning, decision trees.
Contents
1 Introduction 2
2 Basics 3
3 Two Motivating Results 4
3.1 Efficient Learnability and the Spectral Norm 4
3.2 Sparse Representations and the Spectral Norm 6
4 Upper Bounds: Functions of Small Norm 7
5 The Spectral Norm of Decision Trees 8
5.1 A General Framework 8
5.2 Some Applications: Decision Trees of Small Norm 10
6 Lower Bounds: Functions of Large Norm 11
6.1 General Functions 11
6.2 Boolean Functions 11
6.3 An Exponential Lower Bound for Majority 12
7 Generalization: g-norms 14
7.1 The g-Framework 14
7.2 Efficient Learnability and the g-Norm 16
7.3 The <7-norm of Decision Trees 18
7.4 L q versus L 18
A Appendix: Proof of the Combinatorial Identity 21
1 Introduction
Most functions are hard to learn efficiently. Functions of small spectral norm, say Kushilevitz and
Mansour [KM], are an exception: they can be learned in polynomial time.
Moreover these functions can be approximated by sparse Fourier series [SB]. They are also com-
putable by threshold circuits of very small depth [Br],[HMPST],[BS],[SB]. And so on: the list of their
virtues is a long one.
Small is indeed beautiful. And that is just one reason to be interested in spectral norms.
Fourier Series and the Spectral Norm. The Fourier series of a function / : {0,1}" -» R
is obtained by expressing it as a linear combination of certain appropriately chosen basis functions:
f = E* 6 {o i}" f( z )Xz where the X z ■ {0, 1}" -► {-1, +1} are the basis functions and the f(z) are real
numbers depending on /. The spectral norm of /, denoted L(J), is the sum of the absolute values of
the coefficients in this linear combination: L(f) = ]C Z 6{o,i}» l/( z )l-
Fourier series have of late proved to be a very useful tool in learning and computational complexity
theory. And as the above indicates, the spectral norm often plays a pivotal role.
It is not just that small is beautiful. If one examines these results more closely, one realizes that
the spectral norm is a kind of "yardstick," its value for a given function measuring the "complexity"
of this function with respect to the task at hand. For example, what the [KM] learning result really
says is that a function is learnable in time polynomial in its spectral norm. The sparse approximation
result [SB] really says that one can approximate a function by a Fourier series having a number of
terms proportional to the spectral norm of the function. And so on.
The value of the spectral norm of a function emerges as a key to many of its computational
properties. In order to exploit results such as those listed above, it becomes important to know more
about this value. In particular, which functions have small spectral norm? Which don't? How does
this relate to their complexity in various models of computation? These are the kinds of questions on
which I will focus.
This Report. I begin by describing a couple of the motivating results mentioned above.
Next I look at upper bounds on the spectral norm, where what one is really interested in, of course,
is to identify the functions of small norm. Here I present two very simple examples.
I then look at the spectral norm of decision trees, an interesting case of how the spectral norm of
a function can relate to its complexity in some model of computation.
In §6 I look at lower bounds, presenting two examples of functions whose spectral norm is close to
the highest possible value for any boolean function.
Finally, I turn to a generalization of the usual spectral norm which I call the g-norm. Results
about the (/-norm being much scarcer in the literature (it is a recent invention of Furst, Jackson and
Smith [FJS]) I will spend more time here on developing techniques and applications. I conclude by
considering briefly the relative merits of the two norms.
What's New. Among the results and proofs included here, some are new. The principal amongst
these are
• A technique for upper bounding the spectral norm of a decision tree over an arbitrary basis.
I introduce a general framework which enables one to compute, in a simple manner, a weight
which for any decision tree is an upper bound on its spectral norm. This extends the result of [KM]
which applied only to decision trees over the parity basis. As a consequence, many new classes of
decision trees become learnable in polynomial time.
• An extension of the [KM] learning algorithm to mutually independent distributions.
I generalize the proof of [KM] to show that their algorithm works to learn functions in time
polynomial in their g-norm when the error of a hypothesis is measured by its proximity to the
target under the mutually independent distribution q.
• A new proof of an exponential (and almost optimal) lower bound on the spectral norm of the
majority function.
This is only new in the sense that I have not seen any other proof (however I know that the
result is known: I have just been unable to track down the reference).
Apologia. It being the first time I have looked at Fourier analysis, I have not hesitated to state
the obvious. In order to tie results together and set the stage, I have often had to prove small things
which are not in the papers only, I'm sure, because they are considered too basic.
I include my proofs anyway. To disambiguate, I have followed the convention of citing in the
theorem statement the source from which I took it: if nothing is cited it means I couldn't find it
anywhere.
Finally, I must apologize for the many things I have omitted. It seemed better to stick to a single
thread than to try to compile a compendium of results in the assigned papers, and of course much
that is interesting is lost.
2 Basics
The set of real valued functions on {0, 1}™ is a 2" dimensional vector space over the reals. Choosing
the "right" basis for this vector space yields the Fourier series.
Here I'll go through the basics, as briefly as possible. I'll give the definitions and a minimum of
the rich set of properties of Fourier series, developing more as the need arises in later sections.
Notation and Conventions. I identify a string z € {0, l} n with the subset of {0,1}" whose
charecteristic vector is z. With this convention, I will apply set-theoretic notation to strings, using
expressions like "i £ z" or "yAz" where y,z are n bit strings and 1 < i < n.
The empty string is denoted A. We adopt the (usual) convention that {0, 1}° = {A}.
Boolean functions for us mean functions whose range is { — 1,+1}.
Fourier Series. We define the inner product of j,g : {0, 1}" ->■ R by
(f,g) = 2- n E, e{ o,i } ~/(^M^).
Note that (/, g) = E[/#] where the expectation is over the uniform distribution on the inputs.
The norm associated to this inner product is ||/||= \f{f,f)-
The parity functions are the functions X z ■ {0, 1}" — ► { — 1, +1} defined for z £ {0, 1}" by
x.(x) = H(-iy> .
{Xz}ze{o,i}" is an orthonormal basis for the vector space of real valued functions on {0,1}" (the
orthonormality is with respect to the inner product defined above).
Suppose / : {0,1}" -> R. Then / = £, e {o,i}» I{z)X z where f(z) = (f,X,)- This (unique)
expansion of / in terms of the parity basis is its Fourier series. The Fourier coefficients of / are the
real numbers f(z) (z £ {0, 1}"). The sequence of Fourier coefficients is also called the spectrum of /.
The Fourier transform is the operator T defined by F(f) = /. This operator is linear: T(af + g) =
aT(f) + T{g) for real a and functions f,g : {0, 1}" -► R.
The spectral norm (or just norm) of / is L(f) = E* e {o,i}» 1/00 1- In tne context of norms, "small"
means polynomially bounded (as a function of n) and "large" means not small.
Parseval's identity
E 2EW ]J» 2 =ll/ll 2
follows directly from the orthonormality of the basis. It follows that if / is boolean then
£*€{o i}" f( z Y = 1- A consequence of this last fact is the following
Proposition 2.1 /// is boolean then L(f) < 2"/ 2 .
A useful property of the parity functions is that X a Xi = XaAb-
Some Intuition for the Boolean Case. Some intuition about what the Fourier coefficients of
a boolean function / measure is provided by Linial, Mansour and Nisan [LMN]. They observe that
f(z) = Pr[/(z) = Xz(x)] - Pr[/(z) ^ X 2 (x)] (the probabilities being over a random choice of the input
ar). Thus one can think of the Fourier coefficient f(z) as measuring the ability of / to approximate
the parity of the bits in z when the input is chosen at random.
Matrix Formulation. The following formulation of the Fourier transform in terms of matrices is
often useful.
The Sylvester type Hadamard Matrices are defined inductively by H — [1] and
H n+ i
Let [/] denote the vector (of length 2") whose components are the values of f(x) in lexicographic
order. Then
Proposition 2.2 [Br] Let f : {0,1}" -»■ R. Then [f] = 2- n H n [f}.
3 Two Motivating Results
Fourier analysis has of late proved to be a very useful tool in learning and computational complexity
theory. And in many recent results, the spectral norm plays a pivotal role. Here are two examples.
3.1 Efficient Learnability and the Spectral Norm
Learning is hard. How hard? The spectral norm of the target function, say Kushilevitz and Mansour
[KM], gives us some indication. Let us see how.
The Model. The learning algorithm A receives as input l",e,<5 and, as an oracle, the function
/ : {0,1}" -> { — 1,+1} which it is trying to learn (the latter means that as part of its normal
A B (i n ,e,6;f)
S <r- Coef(l n ,A,e5(n)- 1 ;/)
h(-)*-sign(^ zeS f(z)x z )
return h
Coef(l n ,a,0;/)
{ Returns a superset of { a/3 : \f(a{3)\ > } }
if a €{0,1}" then return {a}
else if E[/2] > 2 then return Coef(l n ,a0, 0; /) U Coef (l n ,al, 0; /)
else return
Figure 1: The Learning Algorithm Ab
operation the algorithm can query the value of / on an input x and receive the answer in one step).
The output of A on these inputs is (an encoding of) a hypothesis h : {0, 1}" -> R. The error of the
hypothesis with respect to the input / is Err h (f) = Pr[/(z) ^ h(x)] (the probability is over a random
choice of a;). We can now say what it means to learn a concept class.
Definition 3.1 For each n let C" be a collection of functions from {0, 1}" -► {-1, +1}. We say that
the concept class C - U„>i C" is learnable if there exists an algorithm A (of the kind described above)
such that Pr[&a ( i», Mi/ )"(/) < e] > 1 - S for every / G C n and every n and e,6 > (the probability
is over the coin tosses of the algorithm).
Note that this is not distribution free learning in the sense of Valiant [Va]. Not only is the algorithm
allowed membership queries but also the error of the hypothesis is measured with respect to the
uniform distribution on the inputs.
The Result. Let C B = Un>iQ where C n B is the class of functions from {0,1}" -» {-1,+1} whose
spectral norm is bounded above by B{n).
Theorem 3.2 [KM] Let B : N — > N. Then the concept class C B is learnable in time polynomial in
5(n),n,e _1 and log<5 _1 .
Broadly speaking, this says that a concept class is learnable in time polynomial in a bound on the
spectral norm of the functions in the class.
The Algorithm. The learning algorithm is presented in Figure 1. Let me explain, very briefly, what
is happening.
The idea is to approximate / by the sum of a small number of "special" terms of its Fourier
series. These special terms are those which have a high Fourier coefficient: specifically, one can show
that if S C {0,1}" includes all strings z satisfying \f(z)\ > ci?(n)- 1 then Y,zes f( z )X* is a g ood
approximation to /.
Call z special if \f(z)\ > 0. The main component of the algorithm is a subroutine Coef which
finds all the special terms of/: Coef (1", A, 0; /) returns a set of n bit strings containing all the special
ones (A is the empty string).
Coef works recursively, building up the special strings from the empty string: if a € {0, l} k then
Coef (l",a, 0;/) returns (a superset of) the set of all special strings which have prefix a. The routine
looks at the function f a : {0, 1}"-* -»■ {-1, +1} defined by f a (x) = E^€{o,i}-* f( a P)X()( x ) : the value
of E[/J] is used to prune the search and keep the running time polynomial.
Finally, let me remark that the algorithm as stated is somewhat inaccurate in that it assumes
we can compute both E[/^] and f(z). The truth, however, is that both can be sufficiently well
approximated for our purposes.
Proofs? They await §7.2 where I will show something more general. The correctness of the above
will follow as a special case.
3.2 Sparse Representations and the Spectral Norm
One of the drawbacks of the Fourier series, from a computational point of view, is that it is just too
large: the series, after all, has an exponential number of terms. So we seek to approximate functions
by sparser Fourier series. But how sparse a series can one get? The answer, again, is in the spectral
norm. The result of Siu and Bruck [SB] that I will describe here shows the existence of approximations
to a function / by a Fourier series having a number of terms which is polynomial in the spectral norm
of/.
Applications abound, particularly in the area of threshold circuit complexity [Br], [IIMPST], [BS],
[SB].
The Result. The theorem of Siu and Bruck [SB] improves on work in [BS]. The proof I will present
here is a slightly modified version of the one in [SB]: I make the probabilistic method more explicit
and use Chernoff bounds to avoid the variance calculation.
Lemma 3.3 (Chernoff Bound) Let X x , . . . ,X m be independent, mean random variables in the range
[-1,1] and letX = YT=i X i- Then
Pr[|X| > A] < 2e~ A2/2m
for any A > 0.
Note: Siu and Bruck [SB] assume the domain of their functions is { — 1, +1}", rather than {0, 1}" as
I assume here. So the statement of the following theorem will look a little different from theirs. It is
however easy to see that the two formulations are equivalent.
Theorem 3.4 [SB] Let /{0, 1}" -> {-1,+1} and let k > 0. Then there exists a set S C {0,1}"
and integers M,w z (z G S) such that \f(x) — g(x)\ < n~ k where g : {0,1}" — ► R is defined by
9{x) = jjJ2.es w*Xz(x). Moreover \S\,\M\ are < 0(n 2k+l L(f) 2 ) and w z £ {-1,+1} (z G S).
Proof: The proof uses the probabilistic method. Define 17,- = {0,1}" U {-Li} with
Prn(z) = / WMB if ze {o,i}"
\ 1 - E ae {o,i}» \a a \/B otherwise,
where B — \L(f)~\ and a z = f(z). Our sample space will be fi = YiiLi ^» w ^h the induced probability
measure. Now define random variables
z .j z i z »\ = I sign(a z ,)xA x ) if *'' G {0,1}"
I otherwise
and let G x = £ Ya=i %,*■ Then
Pr [\G X - f(x)\ > n- k ] = Pr
^n 7 Nf(x)
B
>
Nn
-k
B
2 n 2k + \
and noting that E[Z t> ] = f{x)/B this is < 2e- Nn "Z 852 by Lemma 3.3. Choosing N = 165
this is < 2~" and thus
Pr [\G X - f(x)\ < n~ k for all x £ {0, 1}"] > .
Thus there is a point (z 1 , . . .,z N ) in our sample space such that \G x (z 1 ,... ,z n ) - f(x)\ < n~ k for all
x € {0,1}" and we can conclude the proof by setting g = j;Ya=i sign(a 2 i)x Z ' (actually the sum is
only over the i for which z l is in {0, 1}"). ■
The reader is referred to [SB] for applications to the construction of small depth threshold circuits.
4 Upper Bounds: Functions of Small Norm
Convinced that functions of small norm are nice functions, the first thing we would like to do is
certainly to find examples of them. In this section I present two simple examples.
AND and OR. For any z € {0,1}" we define AND,, OR, : {0,1}" -► {-1,4-1} by
AND, (a:) = (-l)A^*-
011,(3) = (-i)V.^-.
Proposition 4.1 L{ AND,),I( 0R 2 ) < 3.
Proof: It is not too hard to compute explicitly the complete spectrum of these functions. Details
omitted. ■
Comparison. Sometimes it is possible to show that a function has small norm by exploiting some
intrinsic property of it (for example, a convenient inductive structure). The comparison function is an
example [SB].
The comparison function C n : {0,1}" —>■ { — 1,+1} is defined for even n by
C <xv) - i 1 lix ~ y
^n{xy) - i _ 1 otherwige
where x,y € {0,!}"/ 2 are identified with the integers X)?ii x i^~' an d Ya=i Vi^~' respectively.
Proposition 4.2 [SB] Let n be even. Then L(C n ) = f + 1.
Proof: The proof is by induction on k = n/2. For the base case k = 1 it is easily verified that
C 2 = f Xoo + f Xoi - 5X10 + |Xn and hence L(C 2 ) = 2. For k > 1 we note that
C 2k+2 {xy) = * — - + — - — L C 2k {x 2 ...x k+1 y 2 ...y k+l ) .
It follows that L(C 2k+2 ) = L(C 2k ) + 1. ■
5 The Spectral Norm of Decision Trees
A particularly interesting aspect of spectral norms is how they can relate to the computational com-
plexity of the function in some model of computation. The first (to my knowledge) example is the
result of [KM] on parity decision trees.
Here I present a general framework for upper bounding the spectral norm of an arbitrary decision
tree in terms of the spectral norms of the functions in its nodes. Extending the result of [KM], this
yields many examples of classes of decision trees of small norm.
5.1 A General Framework
The Model. A decision tree is a binary tree in which each node is labeled by a boolean function
{0, 1}" -»■ {-1,+1}. The tree computes (or defines) a function / : {0, l} n -» {-1, +1} in the natural
way. On any input x € {0, 1}", we start at the root and compute the value at x of the function g by
which the root is labeled. We branch according to this value: to the left if it was -1 and to the right
if it was 1. We continue in this way until we hit a leaf. Here we compute the value at x of the function
g by which this leaf is labeled, and this value is the value of the function / on input x.
Note that we do not for now restrict the functions labeling the nodes: any boolean function is
allowed.
I will identify the decision tree with the function it computes. Thus I will speak of the Fourier
transform or the spectral norm of a decision tree, and write things like T{z) or L(T) correspondingly.
Similarly I identify a node with the function which labels it and speak of the Fourier transform or
norm of a node.
Finally, |T| will denote the number of nodes in T.
Degrees. We assign to each function a number which we will call its degree. The degrees of the
constituent nodes will enable us to measure the spectral norm of a decision tree. The definition itself
is very simple: we let deg(g) = \[L{g) + 1]. So the degrees are, strictly speaking, the spectral norms
themselves: it is just more convenient to translate slightly.
Note that L(g) > 1 for any boolean g by Proposition 6.1. So 1 is also the minimal degree of any
boolean function.
Weight of a Decision Tree. We will assign to each decision tree a weight which is computed as
a function of the spectral norms (or, more precisely, the degrees) of its constituent nodes. The weight
w(T) of a decision tree T is defined by induction on the structure of the binary tree T as follows:
• If \T\ = 1 then we let w(T) = L(g) where g is the function labeling the (single) node which
comprises the tree.
• If \T\ > 3 then T is as depicted in Figure 2 and we let w(T) = deg{g)[w{T_ 1 ) + w(Ti)].
Useful Facts. Before we can prove the main theorem we need to develop some facts about the
spectral norm.
Lemma 5.1 L(fx s ) = L{f) for any function f .
Proof: We have
fXs(z) = 2- n Z*f(x)Xs(x)x z (x) = 2-"EJ(s)X Ji ,(x) = f(sAz)
0.
Figure 2: The decision tree T
and the lemma follows from the fact that the map z^^Azisa permutation of {0, 1}". ■
Lemma 5.2 Suppose f,g : {0,1}" — ► R. Then
(1) L(f + g)<L(f) + L{g)
(2) L(fg)<L(f)L(g).
Proof: By the linearity of the fourier transform we have F(f + g){z) = F(f)(z) + T{g)(z). Apply
the triangle inequality and then sum over all z to get (1).
Let a z — g{z). Then making use of (1) we have
Hfg) = L(f-J2 z a zXz ) = L(^ z a z fx z ) < ^2 z L(a z f Xz ) = £ z \a z \L(f Xz ) .
But by Lemma 5.1 this equals
E>,|£(/) = Z(/)£,k| = Hf)L(g)
which establishes (2). ■
Main THEOREM. The main theorem is now very simply stated. It says that the spectral norm of a
decision tree is at most its weight.
Theorem 5.3 Let T be a decision tree. Then L{T) < w(T).
Proof: The proof is by induction on the structure of the tree. The base case of T having only one
node is easily verified. Now suppose T has > 3 nodes. Visualize T as depicted in Figure 2. Now
observe that
T(x) = ^[l-g(x)]T. 1 (x) + ^[l + g(x)]T 1 (x)
= \T_,{x) + \T x (x) - \T. x {x)g{x) + ^(x^x) .
By Lemma 5.2 we get
L(T) < ^L(T. 1 ) + ^L(T 1 ) + ^L(T. 1 )L(g) + ^L(T 1 )L(g)
= i [L{g) + 1] UT-r) + \ [L{g) + 1] L(T,)
= deg(g)[L(T_ 1 ) + L(T 1 )] .
But by induction £(T_i) < w(T_i) and L(T X ) < wfa) so the above is < deg{g)[w(T_ 1 ) + wfc)] =
w(T). m
5.2 Some Applications: Decision Trees of Small Norm
Usually, we are interested in decision trees all of whose nodes are from some small basis of functions.
For example, [KM] consider decision trees where the nodes all compute parity functions. In many of
these special cases, the weight of the tree is easy to compute. Theorem 5.3 can then be applied to
obtain a useful upper bounds on the spectral norm of the tree.
This technique is quite general, and one can use it to bound the spectral norm of trees with all
kinds of combinations of functions in the nodes. Here I illustrate with just two examples.
Parity Decision Trees. Let us begin by deriving the result of [KM] that the spectral norm of a
parity decision tree is bounded above by the number of nodes in the tree (a parity decision tree is a
decision tree each of whose internal nodes is labeled by a parity function X, and each of whose leaves
is labeled +1 or —1).
Corollary 5.4 [KM] Let T be a parity decision tree. Then L(T) < \\T\/2\.
Proof: By the main theorem it suffices to show that the weight of a parity decision tree T is < [|T|/2] .
The important observation for this is that deg{X s ) = 1- The fact that w(T) < \\T\/2] is now easily
established by induction using the definition of w. ■
Characterization of Degree 1 Functions. We note that the only fact about the parity functions
that we used in the above is that deg(x s ) = 1- Thus the theorem extends to decision trees each of
whose nodes is labeled by a function of degree 1. As the following characterization of degree 1 functions
shows, however, these decision trees are actually identical to parity decision trees.
Proposition 5.5 Suppose g is boolean. Then deg(g) - 1 if and only if ' L(g) = 1 if and only if g - ±X S
for some s.
Proof: It is clear that deg(g) — 1 if and only if L(g) = 1, and it is also clear that L(g) = 1 if g - ±X,-
To complete the proof it suffices to show that if L(g) = 1 then g - ±X S for some s. So suppose
L{g) — 1. Then we have
1 = \9(x)\ = IE,ff(*)X,(*)| < E,»,(i)l = L(g) = 1
and thus | J2 Z d( z )X z (x)\ = £* \g(z)x z {x)\. It follows that for each x either g(z)x z (x) > for all z or
g(z)x z {x) < for all z. Moreover the former happens when g{x) = 1 and the latter when g(x) — -1,
because g(x) = Y, z 9( z )Xz{x)- So fixing any z such that g{z) ■£ (at least one such must exist) we
can conclude that sign(g(z)) - X z (x) if g(x) - 1 and sign{g{z)) = -X z {x) if g{x) = -1. But this is
true for all x, and thus g is either x z or —X z - ■
So the parity functions and their negations are exactly the boolean functions of minimal degree.
Computing the Weight. Before proceeding to more applications let us derive a more useful for-
mulation of the weight function. Namely,
w ( t ) = e n <mo ■
leaves I ancestors i of I
That is, the weight is the sum, over all leaf to root paths, of the product of the degrees along these
paths (we are identifying a node with the function labeling it, so that deg(i) means the degree of the
function labeling node i).
AND AND OR. We looked at the spectral norm of the AND and OR functions in §4, and by
Proposition 4.1 we know that deg(AND 2 ),deg(0R z ) < 2. So these functions are just a step higher
than parity in the degree hierarchy, and the natural choice of an addition to the basis to get a richer
class of decision trees.
Corollary 5.6 Let T be a decision tree over the basis {x,, A~ND„ OR, : s e {0, 1}"} with the property
that any root to leaf path has at most 0(log n) AND, or OR s nodes. Then L(T) < iTlrc ^.
Proof: The weight is easily bounded using the above expression. ■
I don't have space for more examples, but these should have sufficed to illustrate quite well how the
main theorem can be applied.
6 Lower Bounds: Functions of Large Norm
I now turn to lower bounds on the spectral norm. I'll first (briefly) consider arbitrary real valued
functions and then move on to the real case of interest: boolean functions.
6.1 General Functions
Let's begin with the following observation.
Proposition 6.1 L(f) > max r \f(x)\ for any function f : {0, 1}" — ► R.
Proof: For any x it is the case that
L{f) = E,l/(*)l = EJ/(*)x,(*)l > IE,/(*)x,(*)| = |/(*)|. ■
This means that for a function to have polynomially bounded spectral norm, its output must be at
most 0(log n) bits long. Many simple and interesting functions are thus ruled out. Addition of two n
bit numbers is one such.
One quickly realizes however, that the above is not really a restriction. In most applications
(learning is an example) the output can be viewed bit by bit. That is, we view the function as a
concatenation of boolean functions, and consider the spectral norm of each of these boolean functions
separately. So we should concentrate on boolean functions.
6.2 Boolean Functions
It is probably not surprising that
Theorem 6.2 Most boolean functions have large spectral norm.
We can view this as a corollary of the fact that functions of small norm are efficiently learnable
(Theorem 3.2). A direct proof is not as obvious as it might seem (and I haven't seen one).
Can we construct a specific example of a boolean function of large norm? The answer is yes. The
function is the "inner product mode 2" (it is "mode," not "mod") defined for even n by
n/2
in(x) = n(-ir*- iA **
«' = 1
10
Theorem 6.3 [MS], [Br] Let n be even. Then L{I n ) = 2"/ 2 .
Proof: This proof uses the matrix representations of Fourier transforms that we introduced in §2.
We claim that [J„] is an eigenvector of H n with eigenvalue 2" /2 . Proposition 2.2 then implies that
\h\ - 2 _n/2 [/ n ], and the result follows. The fact that [/„] is an eigenvector of H n with eigenvalue 2" /2
is easily established by induction on k = n/2. ■
This proof is from [Br] who however cites [MS] for the result.
We note that by Proposition 2.1 the bound of Theorem 6.3 is tight.
A few other simple examples of functions of exponential spectral norm are known [Br],[BS]. The
proofs in all cases are based on inductive ideas like the ones above (and indeed the functions are
constructed to have enough inductive structure to make these proofs go through). I'll now consider a
somewhat different kind of boolean function: majority.
6.3 An Exponential Lower Bound for Majority
The majority function M n : {0, 1}" —>■ {-!,+!} is defined by
v ' —1 otherwise.
2
What's Known. Siu and Bruck [SB] show that the spectral norm of M n is not polynomially bounded.
Their proof relies on Theorem 3.4 and results from [Br] and [HMPST]. The results from [Br] and
[HMPST] are first used to show that M n is not approximable (to within o(rc _1 ) error) by any sparse
Fourier series. That L(M n ) is not polynomially bounded then follows from Theorem 3.4.
Assuming this was the best known ([SB] did not indicate otherwise) I proved the exponential lower
bound which appears as Theorem 6.4 below. Later I read [BOH] in which, in a different context,
appear equations for the complete spectrum of the majority function. A lower bound at least as good
as Theorem 6.4 is easily derived from this.
They attribute the spectrum to [Ka]. I was however unable to find [Ka] anywhere, and thus do
not know the proof of the result stated in [BOH].
I should note, though, that [SB], [BOH] all consider only the case of n being odd (they are interested
in majority as a linear threshold function and the case of n even is inconvenient because Yli x > could
be 0). My proof applies (only) to n even.
Before going on it is also worth noting that techniques along the lines of those used in the proof of
Theorem 6.3 would not appear to be useful here because the majority function has no clear inductive
structure.
The Lower Bound. The following exponential lower bound is within a polynomial factor of optimal:
Theorem 6.4 Let n > 1 be a multiple of 4. Then
4f>-i/"
L(M n ) > -=— 2"/ 2 .
V27rn
Let's proceed to the proof.
11
The attack is pretty direct. What I will do is compute explicitly the "middle" Fourier coefficients
(that is, those corresponding to the parity over a size ra/2 subset of the variables), and get a bound
based on these alone.
A Combinatorial Identity. I will need the fact that
This is probably known, although I haven't been able to find a reference. For completeness, I provide
in Appendix A a simple proof via generating functions.
Computing the Middle Fourier Coefficients. The following lemma gives explicit values for
the middle Fourier coefficients of the majority function.
Lemma 6.5 Let z £ {0, l} 4 * have exactly 2k ones. Then
(^,^>=2- 4i W(-l) 1 .
Proof: We begin by observing that the value of the fourier coefficient (M 4k ,X z ) depends only on the
number of ones in z. So without loss of generality we can set z = l 2k Q 2k . Now let A = 2 4k (M 4k ,X z )-
We claim that
2k
*-£$<-"•
(i)
whence the result follows by our combinatorial identity.
To establish Equation (1) we begin by noting that M 4k (x)x z (x) = (-1)* M ik (l i 2h - i V Q 2k ~ 3 ) where
i is the number of ones in the left half of x and j is the number of ones in the right half of a;. It follows
that
A
= EE ( 2 -) ( 2 ,*V-i)'M 4t (r o»-W*->)
2k
! =
E l 2k )(-i) j
2k
E
j = 2k-i
2V
2fc-t'-l /np
E
j=0
The i = k term of the above sum equals ( 2 fc *) (-l) k . Let B be the remaining part of the sum. Noting
that ( 2 -)(-iy = ( 2 2 k k _ l )(- 1 ¥ k ~ i we can rewrite B
k-l
£ft<-"
2k
E
j=2k-i
'2k y
as
2fc— i— 1
2k
t'-l
- E
'*)+±(*VW
i=°
j=o
Using the fact that ( 2k ) = (^ .) the term in square brackets simplifies to
E
2k;
2k
E
j = i+l
2k
2k\ ™ 2k'
i-1
E
i=o
r 2fc N
= 2
^2^^
12
and putting all this together we get
as desired. ■
Proof of Theorem 6.4. The proof of Theorem 6.4 follows easily now. First let us recall the bound
f 2n\ e- 1 / 6 "
i > ^=^2 2n (2)
(which is obtained using Stirling's formula). Now let k - n/4. Then L(M 4k ) = Yl* \( M 4k,X z )\ >
J2\z\=2k \(M<ik,Xz)\, and this by Lemma 6.5 and Equation (2) is at least
as desired.
7 Generalization: g-norras
I now turn to a generalization of the basic Fourier series and the corresponding spectral norms (q-
norms). Results about the g-norm being much scarcer in the literature (it is a recent invention of Furst,
Jackson and Smith [FJS]) I will spend more time here on developing techniques and applications. In
particular, I present a new application.
The motivation, once again, comes from learning.
One of the drawbacks of the [KM] learning algorithm is that the quality of the hypothesis is
measured by its proximity to the target under one particular fixed distribution: the uniform one. Can
we move any closer to some kind of distribution free learning?
History, luckily, provides a parallel. Linial, Mansour and Nisan [LMN] had used Fourier analysis
to develop an algorithm to learn AC functions under the uniform distribution. Furst, Jackson and
Smith extended this result to what they called mutually independent distributions. In order to do this
they developed a more general Fourier transform in which the basis is defined in terms of the mutually
independent distribution q.
The tools and framework introduced by [FJS] provide an avenue to similarly generalize the [KM]
learning algorithm. Measuring the quality of a hypothesis by its proximity to the target under q, we
will be able to learn in time polynomial in the g-norm.
Here I will present this generalization of [KM] and then try to say something about g-norms and
their relation to the usual norm.
7.1 The g-Framework
I must begin by developing the machinery. Most of the work, in fact, is here: once things have been
properly set up the lemmas of [KM] can be generalized in a pretty straightforward manner.
Mutually Independent Distributions. A probability distribution q : {0,1}" ->■ [0,1] is mutually
13
independent if the random variables xi, . . . ,x„ are independent.
We will be assuming that < Pr[x,- = 1] < 1 for each i. We say that q is polynomially bounded if
1 1
max
Ki<n
+
.Pr[ii = 1] Pr[«,- =0]J
is polynomially bounded (as a function of n).
g-FouRlER Series. Let q be mutually independent. We define the inner product of f,g : {0, 1}" — > R
by
Note that (f,g) q = E[/#] where the expectation is over q.
The norm associated to this inner product is ||/||,= v/(/?/)j-
Let fj, { = Pr[x,- = 1]. Note that /x,- is the mean of the random variable a;,-. Let the standard
deviation \//Xj(l — Hi) of x,- be denoted by cr,-.
For z, a; £ {0, 1}" we define
cj>(z,x) = n * g . * •
We call 4> the basis associated to q because of the following
Proposition 7.1 [Ba],[FJS] Let q : {0,1}" — > [0,1] 6e mutually independent and let <f> be as defined
above. Then {<f>(z, -)}^e{o,i}" is an orthonormal basis for the vector space of real valued functions on
{0,1}" (the orthonormality is with respect to the inner product {•,•)$,).
Note that (f>(z, •) = Xz when q is the uniform distribution.
To get a feel for these definitions, let's check the orthonormality. For a,b e {0, 1}" we have
<#a, ■),#&,■)>, = E
n^n
Hi
= E
tea "■ jet J
and the mutual independence implies that this equals
n
i£aC\b
Hi J, i
n
i G a A b
l^i % i
He
zGanfe
fJ'i & %
n e
H'i %i
<Ji
Now each term of the first product is 1 while each term of the second is 0, and thus the above expression
is 1 if a = b and otherwise.
The Fourier series of / : {0,1}" — > R is £^e{o i}» f( z )<f>( z i x ) where f{z) — (f,<p(z,-)) q . The
(/-spectral norm (or just g-norm) is L q (f) = X^e{o n- |/(^)|. Parseval's identity as usual says that
Z^e{o,i}» f( z ) = ll/llj-
I'm abusing notation a little by denoting by f(z) both the Fourier coefficients under q and the
(usual) Fourier coefficients with respect to {Xz}. The context should usually make it clear which I
mean, and if both are around I will disambiguate.
The Shifted Distribution and Basis. With minor changes the above is the framework of [FJS].
I now have to go a little further.
14
Let q be mutually independent and <f> the associated basis. Suppose 1 < k < n.
For y e {0, 1}* and < j < n - k we define q^y) = Vi[x j+l = y u .. .,x j+k = y k ]. Observe that qj
is a mutually independent distribution on {0,1}*.
For notational convenience we identify q with q.
For x, z € {0, 1}* and < j < n - k we define
&(*>*) = n
H'j-^-i i
We also define <^(A, A) = 1 for all j = 0, . . . ,n.
For notational convenience we identify (f> with cj>.
The fact that ^ is a mutually independent distribution on {0, 1}* implies, by Proposition 7.1, that
{(f>j(z,-)}z€{o,i} k is an orthonormal basis for the vector space of real valued functions on {0, l} k (the
orthonormality is with respect to the inner product {•, -} 9j ).
We'll denote by E ; the expectation over qj.
A Switching Lemma. One very useful property of the ^-basis is that
for any a,b G {0,1}* and any j = 0, . . . ,n - k.
7.2 Efficient Learnability and the g-Norm
Fix a mutually independent distribution q. The only change in the learning model is that the proba-
bility in the definition of the error Err h (f) = Pr[/(ar) / h(x)] is now taken over q (rather than over x
chosen at random).
The Result. Let C B (q) = \J n >i Cb(<i) where C B (q) is the class of functions from {0,1}" ->■ {-1,+1}
whose spectral g-norm is bounded above by B(n).
Theorem 7.2 Let B : K —>■ N. Let q be a polynomially bounded, mutually independent distribution.
Then the concept class C B (q) is learnable in time polynomial in i?(n),ra,e _1 and log 8~ l .
Note that when q is the uniform distribution this reduces to the [KM] result.
The algorithm is shown in Figure 3. Again, one can check that when q is the uniform distribution
it is just the algorithm of Figure 1.
To show that it works, I now generalize the three basic lemmas of [KM]. I'll state and prove the
lemmas without elaborating overly; the discussion in §3.1 should suffice to see why these lemmas imply
the correctness of the algorithm, and for even more detail the reader is referred to [KM].
The Lemmas. Fix a mutually independent distribution q and its associated basis <f>. Let / : {0, 1}" -»
R (note we are not requiring / to be boolean). Suppose < k < n - 1 and a € {0,1}*. Define
/„ : {0, 1}"-* -+ R by f a (x) = E^o.i}-* /(«/?)^(/?, x).
The first lemma shows that it suffices to find the terms of sufficiently large Fourier coefficient. The
proof remains identical to the one in [KM].
15
5*-Coef(l n ,A,€5(n)- 1 ;/)
h(-)^-sign(j: zeS f(z)cj>(z,.))
return h
Coef(l n ,a,0;/)
{ Returns a superset of { a/3 : |/(a/3)| > } }
if a€{0,l} n then return {a}
else Let k be the length of a
if Ejt[/2]>0 2 then return Coef(l n ,aO, 0; /) U Coef (l n , al, 0; /)
else return
Figure 3: The Generalized Learning Algorithm AB,q
Lemma 7.3 Let e be > and suppose S D {z : \f(z)\ > eL q {f)~ l }. Let g(x) = Exgs f(*)<K z > x )-
Then E[(/ - g) 2 } < e. Moreover if f is boolean then Pr[/(a;) ^ sign (#(x))] < e.
Proof:
E[(/-<?) 2 ] = E 2£M »[/W-sWF = T., ts K*?,
and the last of these is at most
max|/(^)|.E, e{ o,i } n|/WI < eL^f)- 1 ■ L q (f) = e.
For boolean / we note that Pr[/(s) f sign(g(x))] < Pr[|/(a;) - g(x)\ > 1] < E[(/ - gf]. ■
The next lemma is used to bound the search time. I state it slightly more generally than [KM] so that
the case of non-boolean functions is covered.
Lemma 7.4 At most a \\f\\ 2 q -2 fraction of the functions f a (a <E {0, l} k ) have ~E k [f 2 ] > 2 •
Proof: We observe that
E E *[/«] = £</«>/«>,* = E(e^/(«w*(^-),e /j 7(«W*(/',
a a a
which by the bilinearity of the inner product and the orthonormality of the basis is
£/(«0) 2 =11/11? •
a/3
The lemma follows. ■
Note that ||/|| ? < L q (f) and hence if £,(/) is polynomially bounded then so is ||/|| ? .
Finally we have to show how to approximate E k [f a ]. It can be done by sampling provided we can
approximate /„ itself. By equating /„ with an expectation the following lemma provides the avenue
to obtain the necessary approximation to it.
Lemma 7.5 f a (x) = E[f(-x)4>(a, •)].
16
Proof: Making liberal use of the identities in §7.1 we have
f a ( X ) = y,K«p)W,x)
= J2 f(y z )M a i y)i(y z ) J2 'M/ 3 ' z )fa{Pi x )
yz
= Y^f(y z )M a ,y)qo(y)\-r7-; (4>k(. z r),M x r)) qk
= ^2f{yx)<i>o{u,y)(io{y)
y
= B[f(-x)4>(a,.)]. M
Learning Non-Boolean Functions. The restriction that the target function be boolean is not
really necessary. For the general case we change the definition of the error to Err h (f) = E[(/ — h) 2 ]
(the expectation, as usual, being over 5). We also require the target functions to have polynomially
bounded output length under some encoding of the output into bits. Finally the algorithm of Figure 3
is modified to output the function </(•) = Ylz^s f{ z )4>z{-) rather than just its sign.
7.3 The g-norm of Decision Trees
The results of §5 can also be generalized by appropriately extending the definition of the degree.
Namely, let q be mutually independent and g : {0, 1}" — > R. Then the degree of g with respect to
q, denoted deg q (g), is defined to be the least positive real number d with the property that L q (fg) <
(2d - l)L t (g) for all / : {0,1}" - {-1, +1}.
It should be noted that when q is the uniform distribution this definition of the degree coincides
with the one in §5. However I suspect that in general it is not true that deg q (g) = \[L q (g) + 1].
The weight w q (T) of a decision tree T is then be defined by induction as before, using the new
definition of the degree, and the proof of the main theorem can be extended to show that L q [T) < w q (T)
for any decision tree T. I omit the details.
7.4 L q versus L
Results such as the above motivate us to understand the relationship between the usual spectral norm
and the ^-spectral norm. In particular, can the change of basis and distribution drastically reduce the
spectral norm?
Proposition 7.6 There is a function of exponential spectral norm which has polynomial spectral q-
norm with respect to some polynomially bounded, mutually independent q.
Proof: Let q be defined by Pr[x,- = 1] = | for each i = 1, . . . , n. Let (f> be the associated basis. Then
L((f>(z,-)) > 2>\ z \l 2 (this follows from Proposition 6.1 since |</!>(z,0 n )| = S^'/ 2 ). So for z having, say, a
constant fraction of ones, the (usual) spectral norm of cf>(z, •) is exponential.
17
■-•*»»M««* .l^a»^
The q-noim of #*, •), on the other hand, is of coarse jast I ww* «K*, •) is a basis function. ■
So #*, •) is not karaahte by the original algorithm ktdtk learaod hy the generalized oae.
PaOitlM. Fiad am example el a aaeiestt faactioa / tad a aolyaoeaially boaaded, omtaaUy indepen-
dent q sach that L(f) m set poiyacwaiaUy boaaMted Vat £^/> ie.
I doat kaow how hard thk »; I hama't had tiaws to try.
18
Acknowledgements
Thanks to Ron Rivest for suggesting (and writing) a lisp routine for evaluating J2i=o ( ; ) (~^)' — I
wouldn't have got the answer without it!
References
[Ba] Bahadur, R,. "A Representation of the Joint Distribution of Respones to n Dichotomous
Items," Studies in Item Analysis and Prediction (edited by H. Solomon), Stanford Uni-
versity Press (1961).
[BOH] Brandman, Y., A. Orlitsky and John Hennessy, "A Spectral Lower Bound Technique for
the Size of Decision Trees and and Two-Level and/or Circuits," IEEE Transactions on
Computers 39(2), 282-287 (February 1990).
[Br] Bruck, J., "Harmonic Analysis of Polynomial Threshold Functions," SIAM J. Discrete
Math. 3(2), 168-177 (1990).
[BS] Bruck, J. and R. Smolensky, "Polynomial Threshold Functions, AC Functions, and Spec-
tral Norms," Proceedings of the 31st Annual IEEE Symposium on the Foundations of
Computer Science, IEEE (1990).
[FJS] Furst, M., J. Jackson and S. Smith, "Learning AC Functions Sampled under Mutually
Independent Distributions," Manuscript (October 1990).
[HMPST] Hajnal, A., W. Maass, P. Pudlak, M. Szegedy and G. Turan, "Threshold Circuits of
Bounded Depth," Proceedings of the 28th Annual IEEE Symposium on the Foundations
of Computer Science, IEEE (1987).
[Ka] Karpovsky, M., "Finite Orthogonal Series in the Design of Digital Devices," Wiley (1976).
[KM] Kushilevitz, E. and Y. Mansour, "Learning Decision Trees using the Fourier Spectrum,"
Manuscript (November 1990).
[LMN] Linial, N., Y. Mansour and N. Nisan, "Constant Depth Circuits, Fourier Transform, and
Learnability," Manuscript (January 1991). Preliminary version in Proceedings of the 30th
Annual IEEE Symposium on the Foundations of Computer Science, IEEE (1989).
[MS] Mac Williams, F. and N. Sloane, "The Theory of Error- Correcting Codes," North- Holland
(1978).
[SB] Siu, K. and J. Bruck, "On the Power of Threshold Circuits with Small Weights,"
Manuscript.
[Va] Valiant, L., "A Theory of the Learnable," Communications of the ACM 27(11), 1134-1142
(1984).
19
A Appendix: Proof of the Combinatorial Identity
Here I give a proof of the combinatorial identity used to derive the lower bound on the spectral norm
of the majority function. Let's begin by recalling the identity:
E( 2 *)ViV = ("!)* ^
2k
k
2k\ . 7 _ /2k\
The proof uses generating functions. We define a,- = ( 2 *) and b { = ( 2 { )( — 1)' and let
Let Yli^o c i x% ^ e *^ e formal power series for G. By definition of G we have
Ik 2k f 9h \ 2
: = i = V * /
So it suffices to show that c 2k = (-l)*^*)- To do this we compute G(x) explicitly. By the binomial
theorem £~ O' = (1 + xf* and £,~ (?)(-l)'V = (1 - a;) 2 *, so
G(x) = (l + a;) 2 *(l-a:) 2 * = (1 - z 2 ) 2 * .
Applying the binomial theorem again we get
g(*) = £( 2 *)(- 1 ) < « w »
and thus c 2 k = ( — 1)*( 2 /) a s desired.
20